- Friday, September 6, 2024
Alibaba Cloud has released Qwen2-VL, a new vision-language model with enhanced visual understanding, video comprehension, and multilingual text-image processing. Qwen2-VL shows superior performance against models like Meta's Llama 3.1 and OpenAI's GPT-4o and supports various applications, including real-time video analysis and tech support. The models, available in three sizes (7B, 2B, and soon 72B), are open-source under Apache 2.0 for the smaller variants.
- Friday, June 7, 2024
Qwen 2 is a new flagship language model from Alibaba Cloud. It slightly surpasses Llama 3 70B on benchmark performance in English while being a better multilingual model.
- Tuesday, September 24, 2024
Alibaba has released over 100 open-source AI models, enhancing its technology to compete with rivals. The new Qwen 2.5 models, upgraded in math and coding, span applications from automobiles to gaming. Alibaba has also launched a new proprietary model, Qwen-Max 2.5, and a text-to-video tool to strengthen its AI and cloud services offerings.
- Thursday, September 26, 2024
Llama 3.2 has been introduced as a significant advancement in edge AI and vision technology, featuring a range of open and customizable models designed for various applications. This release includes small and medium-sized vision large language models (LLMs) with 11 billion and 90 billion parameters, as well as lightweight text-only models with 1 billion and 3 billion parameters. These models are optimized for deployment on edge and mobile devices, making them suitable for tasks such as summarization, instruction following, and rewriting, all while supporting a context length of 128,000 tokens. The vision models are designed to excel in image understanding tasks, providing capabilities such as document-level comprehension, image captioning, and visual grounding. They can process both text and image inputs, allowing for complex reasoning and interaction with visual data. For instance, users can query the model about sales data represented in graphs or seek navigational assistance based on maps. The lightweight models, on the other hand, focus on multilingual text generation and tool-calling functionalities, enabling developers to create privacy-focused applications that operate entirely on-device. Llama 3.2 is supported by a robust ecosystem, with partnerships established with major technology companies like AWS, Databricks, and Qualcomm, ensuring that the models can be easily integrated into various platforms. The release also includes the Llama Stack, a set of tools designed to simplify the development process across different environments, including on-premises, cloud, and mobile devices. The models have undergone extensive evaluation, demonstrating competitive performance against leading foundation models in both image recognition and language tasks. The architecture of the vision models incorporates new adapter weights that allow for seamless integration of image processing capabilities into the existing language model framework. This innovative approach ensures that the models maintain their text-based functionalities while expanding their capabilities to include visual reasoning. In addition to the technical advancements, Llama 3.2 emphasizes responsible AI development. New safety measures, such as Llama Guard, have been introduced to filter inappropriate content and ensure safe interactions with the models. The lightweight versions of the models have been optimized for efficiency, making them more accessible for deployment in constrained environments. Overall, Llama 3.2 represents a significant leap forward in the field of AI, promoting openness and collaboration within the developer community. The models are available for download and immediate development, encouraging innovation and the creation of new applications that leverage the power of generative AI. The commitment to responsible AI practices and the continuous engagement with partners and the open-source community highlight the potential for Llama 3.2 to drive meaningful advancements in technology and society.
- Wednesday, October 2, 2024
NVIDIA has introduced NVLM 1.0, a series of advanced multimodal large language models (LLMs) that excel in vision-language tasks, competing with both proprietary models like GPT-4o and open-access models such as Llama 3-V 405B and InternVL 2. The NVLM-D-72B model, which is part of this release, is a decoder-only architecture that has been open-sourced for community use. Notably, NVLM 1.0 demonstrates enhanced performance in text-only tasks compared to its underlying LLM framework after undergoing multimodal training. The model has been trained using the Megatron-LM framework, with adaptations made for hosting and inference on Hugging Face. This adaptation allows for reproducibility and comparison with other models. Benchmark results indicate that NVLM-D 1.0 72B achieves impressive scores across various vision-language benchmarks, such as MMMU, MathVista, and VQAv2, showing competitive performance against other leading models. In addition to multimodal benchmarks, NVLM-D 1.0 also performs well in text-only benchmarks, showcasing its versatility. The model's architecture allows for efficient loading and inference, including support for multi-GPU setups. Instructions for preparing the environment, loading the model, and performing inference are provided, ensuring that users can effectively utilize the model for their applications. The model's inference capabilities include both text-based conversations and image-based interactions. Users can engage in pure-text dialogues or ask the model to describe images, demonstrating its multimodal capabilities. The documentation includes detailed code snippets for loading images, preprocessing them, and interacting with the model. The NVLM project is a collaborative effort, with contributions from multiple researchers at NVIDIA. The model is licensed under the Creative Commons BY-NC 4.0 license, allowing for non-commercial use. The introduction of NVLM 1.0 marks a significant advancement in the field of multimodal AI, providing powerful tools for developers and researchers alike.
- Wednesday, April 17, 2024
Vision Language Models (vLLMs) often struggle with processing multiple queries per image and identifying when objects are absent. This study introduces a new query format to tackle these issues, and incorporates semantic segmentation into the training process.
- Wednesday, April 3, 2024
The Draw-and-Understand project's SPHINX-V is a multimodal large language model designed to deepen human-AI interaction through visual prompts.
- Wednesday, May 15, 2024
Google released and teased a few open source models in its launch today. One actually-released model is a vision language model based on SigLIP. It is extremely easy to tune and extend to a variety of tasks. This Colab Notebook shows how to do so with clean, readable code.
- Tuesday, March 5, 2024
The All-Seeing Project V2 introduces the ASMv2 model, which blends text generation, object localization, and understanding the connections between objects in images.
- Wednesday, March 20, 2024
Researchers have developed a new framework to help vision-language models learn continuously without forgetting previous knowledge using a system that expands the model with special adapters for new tasks.
- Monday, April 1, 2024
xAI announced its next model, with 128k context length and improved reasoning capabilities. It excels at retrieval and programming.
- Tuesday, August 6, 2024
Groq, the lightning-fast AI chip startup, is raising a substantial round to meet demand for large language model inference.
- Friday, May 24, 2024
Google has introduced PaliGemma, an open-source vision language model with multimodal capabilities that outperforms its contemporaries in object detection and segmentation. Optimized for fine-tuning on specific tasks, PaliGemma opens possibilities for custom AI applications and provides comprehensive resources for immediate use. PaliGemma demonstrates superior results in OCR and shows promise for various use cases when fine-tuned with custom data.
- Monday, May 13, 2024
Salesforce has trained and released the 3rd non-commercial version of the popular BLIP models, vision and language models mainly used for image understanding and captioning.
- Monday, April 15, 2024
xAI has announced that its latest flagship model has vision capabilities on par with (and in some cases exceeding) state-of-the-art models.
- Thursday, June 20, 2024
Microsoft has released an MIT-licensed set of small VLMs that dramatically outperform much larger models on captioning, bounding, and classification.
- Thursday, September 19, 2024
An impressive array of open models that approach the frontier of performance. Specifically, they have strong performance on code, math, structured output, and reasoning. The Qwen team has also released a suite of sizes for a variety of use cases.
- Monday, June 24, 2024
NLUX is a conversational AI JavaScript library that provides a UI for large language models. It makes it super simple to integrate powerful large language models into web apps. NLUX features React components and hooks, LLM adapters, streaming LLM output, and custom renderers.
- Thursday, May 2, 2024
AWS has launched Amazon Q, a generative AI assistant aimed at improving software development and decision-making by leveraging a company's internal data. Amazon Q facilitates coding, testing, and app development for developers, while offering data-driven support for business users through natural language interaction. The service also includes Amazon Q Apps, enabling the creation of custom AI applications without coding expertise.
- Thursday, September 12, 2024
French AI startup Mistral has launched Pixtral 12B, a 12-billion-parameter multimodal model capable of processing both images and text. Available via GitHub and Hugging Face, the model can be fine-tuned and used under an Apache 2.0 license. Its release follows Mistral's $645 million funding round and positions the company as a significant player in Europe's AI landscape.
- Tuesday, July 23, 2024
ElevenLabs has introduced a new model, Turbo 2.5, that unlocks high-quality low-latency conversational AI for nearly 80% of the world - Hindi, French, Spanish, Mandarin, and 27 other languages. For the first time, it supports Vietnamese, Hungarian, and Norwegian text-to-speech. English is now 25% faster compared to Turbo v2.
- Tuesday, March 19, 2024
Apple researchers have developed methods for training large language models on both text and images, leading to state-of-the-art performance in multimodal AI tasks.
- Tuesday, March 5, 2024
OpenAI introduced a Read Aloud feature for ChatGPT, allowing it to verbally respond to users in 37 languages with five voice options and enhancing its multimodal capabilities and usability for on-the-go interactions across web and mobile platforms.
- Monday, April 22, 2024
50 vision/language datasets combined into a single format to allow for improved training of models.
- Thursday, May 30, 2024
Reason3D is a novel multimodal large language model designed for comprehensive 3D environment understanding.
- Wednesday, May 15, 2024
GPT-4o multimodal abilities, integrating vision and voice, promise significant advances in how AI interacts with the world, paving the way for AI to become a more ubiquitous presence in daily life.
- Thursday, June 20, 2024
Vercel AI SDK 3.2 introduces new features like agent workflows for complex tasks, new model providers, embedding support for semantic search, and improvements in observability and client-side tool calls.
- Tuesday, March 26, 2024
Cerebras' new wafer chip can train 24T parameter language models. It natively supports PyTorch.
- Friday, June 7, 2024
The Together AI team has a novel VLM that excels at extremely high resolution images due to its efficient architecture.
- Thursday, September 26, 2024
Llama 3.2 is the latest iteration of an open-source AI model family designed for versatility and efficiency in various applications. This release offers a range of model sizes, including 1B, 3B, 11B, and 90B parameters, catering to different needs from lightweight mobile applications to more complex multimodal tasks that involve both text and image processing. The 1B and 3B models are optimized for on-device applications, making them suitable for tasks like summarizing discussions or integrating with tools such as calendars. In contrast, the 11B and 90B models are designed for more demanding multimodal applications, capable of processing high-resolution images and generating relevant text outputs. Llama 3.2 emphasizes a streamlined developer experience through the Llama Stack, which provides a comprehensive toolchain for building applications. Developers can choose from popular programming languages like Python, Node, Kotlin, and Swift, allowing for rapid development and deployment across various environments, including on-premises and edge devices. The common API facilitates interoperability, reducing the need for extensive model-level changes and accelerating the integration of new components. Performance evaluations of Llama 3.2 have been conducted across over 150 benchmark datasets, demonstrating its capabilities in both language understanding and visual reasoning. The model has shown competitive results against other leading models in real-world scenarios, further solidifying its position in the AI landscape. The Llama ecosystem has seen significant growth, with over 350 million downloads on platforms like Hugging Face, highlighting its popularity and the support from partners such as ARM, MediaTek, and Qualcomm, which enable the deployment of lightweight models on mobile and edge devices. Companies like Dell are also integrating Llama Stack into their offerings, promoting the adoption of open models in enterprise settings. Real-world applications of Llama 3.2 are already being showcased by various organizations. For instance, Zoom has developed an AI companion that enhances productivity through chat and meeting summaries, while DoorDash utilizes Llama to streamline internal processes. Additionally, KPMG has explored secure open-source LLM options for financial institutions, demonstrating the model's versatility across different industries. Overall, Llama 3.2 represents a significant advancement in the field of AI, providing developers with powerful tools to create efficient, customizable applications while fostering a collaborative community around open-source AI technologies.